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In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., 
on algorithms capable of extracting, from the informal and unstructured texts that are generated during 
everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about 
methods based on supervised learning, i.e., methods for training an information extraction system from 
manually annotated examples. While a lot of work has been devoted to devising learning methods that 
generate more and more accurate information extractors, no work has been devoted to investigating the 
effect of the quality of training data on the learning process. Low quality in training data often derives 
from the fact that the person who has annotated the data is different from the one against whose judgment 
the automatically annotated data must be evaluated. In this paper we test the impact of such data quality 
issues on the accuracy of information extraction systems as applied to the clinical domain. We do this by 
comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who 
has also annotated the test data, and by whose judgment we must abide), with the accuracy deriving from 
training data annotated by a different coder. The results indicate that, although the disagreement between 
the two coders (as measured on the training set) is substantial, the difference is (surprisingly enough) not 
always statistically significant. 
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1. INTRODUCTION 

In the last five yea rs there has been a flurry of work (see e .g., | Kelly et al. 2014 
Pradhan et al. 20141 ISun et al. 20131 ISuorninen et al. 2013[ lUzuner et al. 2012 


Uzuner et al. 20111 ) on information extraction from clinical documents, i.e., on al¬ 
gorithms capable of extracting, from the informal and unstructured texts that are 
generated during everyday clinical practice (e.g., admission reports, radiological 
reports, discharge summaries, clinical notes), mentions of concepts relevant to such 
practice. Most of this literature is about methods based on supervised learning, i.e., 
methods for training an information extraction system from manually annotated ex¬ 
amples. While a lot of work has been devoted to devising text representation methods 
and variants of the aforementioned supervised learning methods that generate more 
and more accurate information extractors, no work has been devoted to investigating 
the effects of the quality of training data on the learning process. Issues of quality in 
the training data may arise for different reasons: 


(1) In several organizations it is often the case that the original annotation is per¬ 
formed by coders (a.k.a. “annotators”, or “assessors”) as a part of a daily routine 
in which fast turnaround, rather than annotation quality, is the main goal of the 
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coders and/or of the organization. An example is the (increasingly frequent) case in 
which annotation is performed via cro wdsourcing using instruments such as, e .g., 
Mechanical Turk, CrowdFlower, etcQ |Grady and Lease 2010tlSnow et al. 20081 . 

(2) In many organizations it is also the case that annotation work is usually carried 
out hy junior staff (e.g., interns), since having it accomplished hy senior employees 
would make costs soar. 

(3) It is often the case that the coders entrusted with the annotation work were not 
originally involved in designing the tagset (i.e., the set of concepts whose mentions 
are sought in the documents). As a result, the coders may have a suhoptimal un¬ 
derstanding of the true meaning of these concepts, or of how their mentions are 
meant to look like, which may negatively affect the quality of their annotation. 

(4) The data used for training the system may sometimes he old or outdated, with the 
annotations not reflecting the current meaning o f the concepts anymore. This is 
an example of a phenomen on, called concept drift | |Quinonero-Candela et al. 2009 1 
ISammut and Harries 20111 . which is well known in machine learning. 


We may summarize all the cases mentioned above hy saying that, should the train¬ 
ing data he independently re-annotated hy an authoritative coder (hereafter indi¬ 
cated as Co), the resulting annotations would he, to a certain extent, more reliable. 
We would also be able to precisely measure this difference in reliability, by mea- 
suring the intercoder agreement (via measures such as Cohen’s kappa - see e.g., 
lArtstein and Poesio 20081 Di Eugenio and Glass 2004 1) between the training data Tr 
as coded by Ca and the training data as coded by whoever else originally annotated 
them (whom we will call, for simplicity, the non-authoritative coder - hereafter indi¬ 
cated as Cp). In the rest of this paper we will take the authoritative coder Ca to be the 
coder whose annotations are to be taken as correct, i.e., considered as the “gold stan¬ 
dard”. As a consequence we may assume that Ca is the coder who, once the system 
is trained and deployed, has also the authority to evaluate the accuracy of the auto¬ 
matic annotation (i.e., decide which annotations are correct and which are not0 In this 
case, intercoder (dis)agreement measures the amount of noise that is introduced in the 
training data by having them annotated by a coder Cp different from the authoritative 
coder Ca¬ 
li is natural to expect the accuracy of an information extraction system to be lower 
if the training data have been annotated by a non-authoritative coder Cp, and higher 
if they have been annotated by Ca herself However, note that this is not a conse¬ 
quence of the fact that Ca is more experienced, or senior, or reliable, than Cp. Rather, 
it is a consequence of the fact that standard supervised learning algorithms are based 
on the assumption that the training set and the test set are identically and indepen¬ 
dently distributed (the so-called i.i.d. assumption), i.e., that both sets are randomly 
drawn from the same distribution. As a result, these algorithms learn to replicate the 
subjective annotation style of their supervisors, i.e., of those who have annotated the 
training data. This means that we may expect accuracy to be higher simply when the 
coder of the training set and the coder of the test set are the same person, and to be 
lower when the two coders are different, irrespective of how experienced, or senior, 
or reliable, they are. In other words, the very fact that a coder is entrusted with the 
task of evaluating the automatic annotations (i.e., of annotating the test set) makes 
this coder authoritative by definition. For this reason, the authoritative coder Ca may 
equivalently be defined as “the coder who has annotated the test set” (or: “the coder 


^ https://www.mturk.com/, http://crowdflower.com/ 

^In some organizations this authoritative coder may well he a fictional entity, e.g., several coders may be 
equally experienced and thus equally authoritative. However, without loss of generality we will hereafter 
assume that Ca exists and is unique. 
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whose judgments we adhere to when evaluating the accuracy of the system”), and the 
non-authoritative coder Cp may equivalently be defined as “a coder different from the 
authoritative coder”. 

The above arguments point to the fact that the impact of training data quality - 
under its many facets discussed in items ([III-® above — on the accuracy of information 
extraction systems may be measured by 

(1) evaluating the accuracy of the system in an authoritative setting (i.e., both training 
and test sets annotated by the authoritative coder Co), and then 

(2) evaluating the loss in accuracy, with respect to the authoritative setting, that de¬ 
rives from working instead in a non-authoritative setting (i.e., test set annotated 
by Ca and training set annotated by a non-authoritative coder (7/30. 

1.1. Our contribution 

In this paper we test the impact of training data quality on the accuracy of information 
extraction systems as applied to the clinical domain. We do this by testing the accuracy 
of two state-of the-a rt systems on a dataset of radiology reports (originally discussed 
in lEsuli et al. 2013ll ) in which a portion of the data has independently been annotated 
by two different experts. In other words, we try to answer the question: “What is the 
consequence of the fact that my training data are not sterling quality? that the coders 
who produced them are not entirely dependable? How much am I going to lose in terms 
of accuracy of the trained system?” 

In these experiments we not only test the “pure” authoritative and non-authoritative 
settings described above, but we also test partially authoritative settings, in which in¬ 
creasingly large portions of the training data as annotated by Ca are replaced with the 
corresponding portions as annotated by Cp, thus simulating the presence of incremen¬ 
tally higher amounts of noise. For each setting we compute the intercoder agreement 
between the two training sets; this allows us to study the relative loss in extraction 
accuracy as a function of the agreement between authoritative and non-authoritative 
assessor as measured on the training set. Since in many practical situations it is easy 
to compute (or estimate) the intercoder disagreement between (a) the coder to whom 
we would ideally entrust the annotation task (e.g., a senior expert in the organization), 
and (b) the coder to whom we can indeed entrust it given time and cost constraints (e.g., 
a junior member of staff), this will give the reader a sense of how much intercoder dis¬ 
agreement generates how much loss in extraction accuracy. 

The rest of the paper is organized as follows. Section[^reviews related work on infor¬ 
mation extraction from clinical documents, and on establishing the relations between 
training data quality and extraction accuracy. In Sections [3] and S] we describe exper¬ 
iments that attempt to quantify the degradation in extraction accuracy that derives 
from low-quality training data, with Section [3] devoted to spelling out the experimen¬ 
tal setting and Section |4] devoted instead to presenting and discussing the results. 
Section[3 concludes, discussing avenues for further research. 


®In the domain of classification the authoritative a nd non-authoritative settin gs have also heen called self- 
classification and cross-classification, respectively IWehber and Pickens 201^ . We depart from this termi¬ 
nology in order to avoid any confusion with self-learning (which refers to retraining a classifier hy using, 
as additional training examples, examples the classifier itself has classified) and cross-lingual classification 
(which denotes a variant of text classification which exploits synergies between training data expressed in 
different languages). 
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2. RELATED WORK 

2.1. Information extraction from clinical documents 


Most works on information extraction from clinical documents rely on meth¬ 
ods based on supervised learning, i.e., methods for training an information ex¬ 
traction system from manually annotated examples. Sup port vector machines 
(SVMs - I Jiang etal. 2011[ILi et al. 2008l[Sibanda et al. 20061 ). hidden Markov mod¬ 
els (HMkfe - ULi et al. 201011). and (especially) conditional random fields (CRFs 


lEsuli et al. 20131 Gupta et al. 20141 Jiang et al. 201ll Jonnalagadda et al. 2012 
' ^ ^ * IWaiag and Patrick 200^) have 

d 


to the 


ILi et al. 20081 [Patrick and Li 20101 ITorii et al. 20111 

been the learners of choice in this field, due to their good performance am 
existence of publicly available implementations. 

In recent years, research on the analysis of clinical texts has been further boosted 
by the existence of “shared tasks” on this topic, su ch as the seminal i2b2 series (“Infor¬ 
matics for Integrati ng Biology and the Bedside” - HSun et al. 201 jllU^neret al. 20121 
lUzuner et al. 20ri]| ). the 2013 ISuominen et al. 201311 and 2014 [Kelly et al. 20141 edi¬ 
tions of ShARe/CLEF eHealth, and the Semeval-2014 Task 7 “Analysis of Clinical Text” 
tPradhan et al. 201411 . In these shared tasks the goal is to competitively evaluate in¬ 
formation extraction tools that recognise mentions of various concepts of interest (e.g., 
mentions of diseases and disorders) as appearing in discharge summaries, and in elec¬ 
trocardiogram reports, echocardiograph reports, and radiology reports. 


2.2. Low-quality training data and prediction accuracy 

The literature on the effects of suboptimal training data quality on prediction accuracy 
is extremely s carce, even within the machine learning literature at large. An early 
such study is IRossin and Klein 19991 . which looks at these issues in the context of 
learning to predict prices of mutual funds from economic indicators. Differently from 
us, the authors work with noise artificially inserted in the training set, and not with 
naturally occurring noise. From experiments run with a linear regression model they 
reach the bizarre conclusion that “the predictive accuracy (...) is better when errors 
exist in training data than when training data are free of errors”, while the opposite 
conclusion is (somehow more expectedly) reached from experiments run with a neural 
networks model. A similar study, in which the context is predict ing the average ai r 
temperature in distributed heating systems, was carried out in llJassar et al. 200^ : 
yet another study, in w hich the goal was pr edicting the production levels of palm oil 
via a neural network, is IKhamis et al. 200511 . 

In the context of a biomedical information extraction tast0 Haddow and Alex II2008II 
examined the situation in which training data annotated by two different coders are 
available, and they found that higher accuracy is obtained by using both versions at 
the same time than by attempting to reconcile them or using just one of them. Their 
use case is different from ours, since in the case we discuss we assume that only one 
set of annotations, those of the non-authoritative coder, are available as training data. 
Note also that training data independently annotated by more than one coder are 
rarely available in practice. 

Closer to our application context, Esuli and Sebastiani II2013II have thoroughly stud¬ 
ied the effect of suboptimal training data quality in text classification. However, in 
their case the degradation in the quality of the training data is obtained, for mere 
experimental purposes, via the insertion of artificial noise, due to the fact that their 
datasets did not contain data annotated by more than one coder. As a result, it is 


^Biomedical IE is different from clinical IE, in that the latter (unlike the former) is usually charac- 
terized hy idiosyncr atic abbreviations, ungrammatical sentences, and sloppy language in general. See 
|Meystre et al. 2008[ p. 129] for a discussion of this point. 
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not clear how well th e typ e of noise they introduce models naturally occurring noise. 
Wehher and Pickens f 2013l also address the text classification task (in the c ontext of e- 
discovery from legal texts), hut differently from HEsuli and Sehastiani 201^ they work 
with naturally occurring noise; differently from the present work, the multiply-coded 
training data they use were coded hy one coder known to he an expert coder and an¬ 
other coder known to he a junior coder. Our work instead (a) focuses on information 
extraction, and (2) does not make any assumption on the relative level of expertise of 
the two coders. 

3. EXPERIMENTAL SETTING 

3.1. Basic notation and terminology 

Let us fix some basic notation and terminology. Let X be a set of texts, where we view 
each text x e X as a sequence x = (xi,..., Xjxi ) of textual units (or simply t-units), such 
that odd-numbered t-units are tokens (i.e., word occurrences) and even-numbered t- 
units are separators (i.e., sequences of blanks and punctuation symbols), and such 
that Xti occurs before xt^ in the text (noted xt^ ^ xt^) if and only if ti < t 2 . We dub 
|x| the length of the text. Let C = {ci, ... ,Cm} be a predefined set of concepts (a.k.a. 
tags, or markables), or tagset. We take information extraction (IE) to be the task of 
determining, for each x G X and for each Cr & C, & sequence y^ = {yri, ■ ■ ■ ,2/r|x|) of 
labels Urt G {cr,Cr}, which indicates which t-units in the text are labelled with tag Cr 
and which are not. Since each G C is dealt with independently of the other concepts 
in C, we hereafter drop the r subscript and, without loss of generality, treat IE as the 
binary task of determining, given text x and concept c, a sequence y = (yi,..., y|x|) of 
labels yt G {c,c}. 

T-units labelled with a concept c usually come in coherent sequences, or “mentions”. 
Hereafter, a mention a of text x for concept c will be a pair {xt ^, xt ^) consisting of a start 
token xti and an end token Xt 2 such that (i) xt^ ^ Xt 2 , (ii) all t-units xt^ :< xt ^ Xt 2 are 
labelled with concept c, and (iii) the token that immediately precedes xt^ and the one 
that immediately follows xt 2 are not labelled with concept c. In general, a text x may 
contain zero, one, or several mentions for concept c. 

In the above definitions we consider separators to be also the object of tagging in 
order for the IE system to correctly identify consecutive mentions. For instance, given 
the expression “Barack Obama, Hillary Clinton” the perfect IE system will attribute 
the PersonName tag to the tokens “Barack”, “Obama”, “Hillary”, “Clinton”, and to the 
separators (in this case: blank spaces) between “Barack” and “Obama” and between 
“Hillary” and “Clinton”, but not to the separator “, ” between “Obama” and “Hillary”. 
If the IE system does so, this means that it has correctly identified the boundaries of 
the two mentions “Barack Obama” and “Hillary Clinton’Q 

3.2. Dataset 

The dataset we have used to test the ideas d iscussed in the pr evious sections is the 
UmbertoI(RadRep) dataset first presented in MEsuli et al. 20131 . consisting of a set of 
500 free-text mammography reports written (in Italian) by medical personnel of the 


®Note that the above notation is not ahle to represent “discontiguous mentions”, i.e., mentions containing 
gaps, and “overlapping mentions”, i.e., multiple mentions sharing one or more tokens. This is not a serious 
limitation for our research, since the above notation can be easily extended to deal with both phenomena 
(e.g., by introducing unique mention identifiers and having each t-unit be associated with zero, one, or sev¬ 
eral such identifiers), and since the dataset we use for our experimentation contains neither discontinuous 
nor overlapping mentions. We prefer to keep the notation simple, since the issue we focus on in this pa¬ 
per (the consequences on extraction accuracy of suboptimal training data quality) can be considered largely 
independent of the expressive power of the markup language. 
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Table I. The distribution of annotations across concepts, at token and mention level, for each coder. 



DEE 

lES 

ITE 

ECH 

LLO 

TFU 

DEP 

BIR 

PAE 

Total 

Tokens annotated by Coderl 

4819 

1529 

7410 

237 

1811 

1672 

585 

466 

1723 

18529 

Tokens annotated by Coder2 

7351 

1723 

7630 

1329 

2544 

2670 

1127 

448 

3495 

24822 

Mentions annotated by Coderl 

204 

140 

190 

51 

164 

149 

19 

128 

344 

1045 

Mentions annotated by Coder2 

282 

145 

188 

102 

193 

171 

26 

103 

399 

1210 


Istituto di Radiologia of Policlinico Umberto I, Roma, IT. The dataset is annotated ac¬ 
cording to 9 concepts relevant to the field of radiology and mammography: BIR (“Out¬ 
come of the BIRADS test”), ITE (“Technical Info”), lES (“Indications obtained from 
the Exam”), TFU (“Followup Therapies”), DEE (“Description of Enhancement”), PAE 
(“Presence/Absence of Enhancements”), ECH (“Outcomes of Surgery”), DEP (“Prosthe¬ 
sis Description”), and LLO (“Locoregional Lymph Nodes”). Note that we had no control 
on the design of the concept set, on its range, and on its granularity, since the choice 
of the concepts was entirely under the responsibility of Policlinico Umberto I. We thus 
take both the concept set and the dataset as given. 

Mentions of these concepts are present in the reports according to fairly irregular 
patterns. In particular, a given concept (a) need not be instantiated in all reports, 
and (b) may be instantiated more than once (i.e., by more than one mention) in the 
same report. Mentions instantiating different concepts may overlap, and the order of 
presentation of the different concepts varies across the reports. On average, there are 
0.87 mentions for each concept in a given report, and the average mention length is 
17.33 words. 

The reports were annotated by two equally expert radiologists, Coderl and Coder2; 
191 reports were annotated by Coderl only, 190 reports were annotated by Coder2 only, 
and 119 reports were annotated independently by Coderl and Coder2. From now on 
we will call these sets l-only, 2-only and Both, respectively; Both(l) will identify the 
Both set as annotated by Coderl, and Both (2) will identify the Both set as annotated by 
Coder2. The annotation activity was preceded by an alignment phase, in which Coderl 
and Coder2 jointly annotated 20 reports (not included in this dataset) in order to align 
their understanding of the meaning of the concepts. 

Table [J reports the distribution of annotations ac ross concepts, at token and men¬ 
tion level, for the two coders; see BEsuli et al. 2013[ Section 4.2] for a more detailed 
description of the UmbertoI(RadRep) dataset that includes additional statfl 


3.3. Learning algorithms 

As the learning algorithms we have tested b ot h linear-chain conditional 
random fields (LC-CRFs - I Lafferty et al. 20^ Sutton and McCallum 2007] 
Sutton and McCallum 20121 ). in Charles Sutton’s URMM implemen tatioifl, and 
hidden Markov support vector machines (HM-SVMs - MAltun et al. 20031 ). in Thorsten 
Joachims’s implementatioifl. Both are supervised learning algorithms ex¬ 

plicitly devised for sequence labelling, i.e., for learning to label (i.e., to annotate) items 
that naturally occur in sequences and such that the label of an item may depend on the 
features and/or on the labels of other items that precede or follow it in the sequence 
(which is indeed the case for the tokens in a text0 LC-CRFs are members of the class 


®No other dataset is used in this paper since we were not ahle to locate another dataset of annotated clinical 
texts that contains reports annotated hy more than one coder and is at the same time publicly available. 

' http://mallet.cs.umass.edu/grmm/ 

^ http://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html 

®Note that only tokens, and not separators, are explicitly labelled. The reason is that both LC-CRFs and 
HM-SVMs actually use the so-called lOB labelling scheme, according to which, for each concept Cr G C, a 
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of graphical models, a family of probability distrib ution s that factorize according to 
an underlying graph [Wainwright and Jordan 2008); see fSutton and McCallum 2012| 
for a full mathematical explanation of LC-CRFs. HM- SVMs are an instantiation 
of “SVMs for structured output prediction” iTsochantaridis et al. 200^ 

for the sequence lab elling task, and have already heen used in clinical information 
extraction (see e.g., [Tang et al. 2012} [Zhang et al. 2014| ). In HM-SVMs the learning 
procedure is based on a large-margin approach typical of SVMs, which, differently 
from LC-CRFs, can learn non-linear discriminant functions via kernel functions. 

Both learners need each token xt to be represented by a vector x* of feature^. In 
this work we have used a set of features which includes one feature representing the 
word of which the token is an instance, one feature representing its stem, one fea¬ 
ture representing its part of speech, eight features representing its prefixes and suf¬ 
fixes (the first and the last n characters of the token, with n = 1,2,3,4), one feature 
representing information on token capitalization (i.e., whether the token is all upper- 
case, all lowercas e, first letter uppercase, or mixed case), and 4 “positional” features 
BEsuli et al. 20131 Section 3.3] that indicate in which half, 3rd, 4th, or 5th, respectively, 
of the text the token occurs in. 

3.4. Evaluation measures 

3.4.1. Classificatio n accuracy. As a measure of classification accuracy we use, 
si milarly to IIEsuli et al. 2()13l . the token-and-separator variant (proposed 
in lEsuli and Sebastiani 2010l ) of the well-known Fi measure, according to which 
an information extraction system is evaluated on an event space consisting of all 
the t-units in the text. In other words, each t-unit xt (rather than e ach mention, 
as in the traditional “segmentation F-score” model HSuzuki et al. 200^ ) counts as a 
true positive, true negative, false positive, or false negative for a given concept Cr, 
depending on whether xt belongs to Cr or not in the predicted annotation and in the 
true annotation. This model has the advantage that it credits a system for partial 
success (i.e., degree of overlap between a predicted mention and a true mention for 
the same concept), and that it penalizes both overannotation and underannotation. 


As is well-known, Fi is the harmonic mean of precision (tt = 


(P = 


TP 


TP+FP 


), and is defined as 


TP 

TP+FN 


TP 


TP 


Fi = 


2'!rp 


TP + FN TP + FP 


2TP 


TP 


-F 


TP 


2TP + FP + FN 


) and recall 


( 1 ) 


TP + FN TP T FP 


where TP, FP, and FN stand for the numbers of true positives, false positives, and 
false negatives, respectively. It is easy to observe that Fi is equivalent to TP divided 
by the arithmetic mean of the actual positives and the predicted positives (or, alter¬ 
natively, the product of tt and p divided by their arithmetic mean). Note that Fi is 
undefined when TP = FP = FN = 0; in this case we take Fi to equal 1, since the 
system has correctly annotated all t-units as negative. 


token can be labelled as Br (the beginning token of a mention of Cr), T (a token which is inside a mention 
of Cr but is not its beginning token), and Or (a token that is outside any mention of Cr). As a result, a 
separator is (implicitly) labelled with concept Cr if and only if it precedes a token labelled with T. We may 
think of the notation of Section [Tl] as an abstract markup lang uage, and of the lOB notation as a concrete 
markup language, in the sense that the notation of Section r3.1l is easier to understand (and will also make 
the evaluation measure discussed in Section [3 .4.1 1 easier to understand) while lOB is actually used by the 
learning algorithms. The two notations are equivalent in expressive power. 

^'^Note that only tokens, and not separators, are explicitly represented in vectorial form, the reasons being 
the same already discussed in Footnotej^ 
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We compute Fi across the entire test set, i.e., we generate a single contingency table 
by putting together all t-units in the test set, irrespectively of the document they be¬ 
long to. We then compute both microaveraged Fi (denoted by Ff) and macroaveraged 
Fi {F^). F^ is obtained by (i) computing the concept-specific values TPr, FPr and FNr, 
(ii) obtaining TP as the sum of the TPr’s (same for FP and FN), and then (iii) applying 
Equation[TJ Ff^ is obtained by first computing the concept-specific Fi values and then 
averaging them across the c^’s. 


3.4.2. Intercoder agreement. Intercoder agreement (ICA), or the lack thereof {intercoder 
disagreement), has been widely studied for over a century (see e.g., | Krippendorff 2004 1 
for an introduction). As a phenomenon, disagreement among coders naturally occurs 
when units of content need to be annotated by humans according to their semantics 
(i.e., when the occurrences of specific concepts need to be recognized within these units 
of content). Such disagreement derives from the fact that semantic content is a highly 
subjective notion: different coders might disagree with each other as to what the se¬ 
mantics of, say, a given piece of text is, and it is even the case that the same coder 
might at times disagree with herself (i.e., return different codes when coding the same 
unit of content at different times). 

ICA may be measured by the relative frequency of the units of content on which 
coders agree, usually normalized by the probability of chance agreement. Many metrics 
for ICA have been proposed over the years, “Cohen’s kappa” probably being the most 
famous a nd widely used (“Scott’s pi” and “Krippendorff’s a lpha” are others); sometimes 


(see e.g., fChapman and Dowling 2006[IEsuli et al. 20131 ) functions that were not ex¬ 
plicitly developed for measuring ICA (such as Fi, that was developed for measuring 
binary classification accuracy) are used. The levels of ICA that are recorded in actual 
experiments vary a lot across experiments, types of content, and types of concepts that 
are to be recognized in the units of content under investigation. This extreme vari¬ 
ance depends on factors such as “annotation domain, number of categories in a coding 
scheme, number of annotators in a project, whether annotators received training, the 
intensity of annotator training, the an notation purpose, and the method used for the 
calculation of percentage agreements” | [Bayerl and Paul 2011] . The actual meaning of 
the concepts the coders are asked to recognize is a factor of special importance, to the 
extent that a concept on which very low levels of ICA are reached may be deemed, 
because of this very fact, ill-defined. 

For measuring intercoder agreement we use Cohen’s kappa (noted k), defined as 


P{A) - P(E) 

1 - F(F) 

(F(p = t = c) + P{p = t = c)) — {P{p = c)P{t = c) + P{p = c)P{t = c)) 
1 — (F(p = c)P{t = c) + P(p = c)P(t = c)) 

TP + TN ,,TP + FP,,TP + FN, ,FN + TN, ,FP + TN 

-((-)(-) + (-)(-)) 

Tl Tl Tl Tl Tl 

^TP + FP,TP + FN, TfTvTTATTTWTTAF 

1 - ((-1-)(-T-) + (-T-)(-T-)) 


( 2 ) 


where F(A) denotes the probability (i.e., relative frequency) of agreement, F(F) de- 
notes the probability of chance agreement, and n is the total number of examples (see 
lArtstein and Poesio 2008 i IDi Eugenio and Glass 2004| for details); here, we use the 
shorthand p = c (resp., t = c) to mean that the predicted label (resp., true label) is c 
(analogously for c). We opt for kappa since it is the most widely known, and best un¬ 
derstood, measure of ICA. For Cohen’s kappa too we work at the t-unit level, i.e., for 
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each t-unit xt we record whether the two coders agree on whether xt is labelled or not 
with the concept c of interest. 

Incidentally, note that (as observed in lEsuli and Sebastiani 20101) we c an compute 
Cohen’s kappa only thanks to the fact that (as discussed in Section [3.4.1l l we conduct 
our evaluation at the t-unit level (rat her at the mention level). Thos e who conduct their 
evaluation at the mention level (e.g., [Chapman and Dowling 2006 1) find that they are 
unable to do so, since in order to be defined kappa needs the notion of a true negative to 
be also defined, and this is undefined at the mention level. Evaluation at the mention 
level thus prevents the use of kappa and leaves Fi as the only choice. 


4. RESULTS 

4.1. Experimental protocol 

In lEsuli et al. 2013II . experiments on the UmbertoI(RadRep) dataset were run using 
either 1-only and/or 2-only (i.e., the portions of the data that only one coder had 
annotated) as training data and Both(l) and/or Both(2) (i.e., the portion of the data 
that both coders had annotated, in both versions) as test data. 

In this paper we switch the roles of training set and test set, i.e., use Both(l) or 
Both (2) as training set (since for the purpose of this paper we need training data with 
multiple, alternative annotations) and 1-only or 2-only as test set. Specifically, we run 
two batches of experiments, Batchl and Batch2. In Batchl Coderl plays the role of the 
authoritative coder (Co) and Coder2 plays the role of the non-authoritative coder (Cp), 
while in Batch2 Coder2 plays the role of Ca and Coderl plays the role of Cp. 

Each of the two batches of experiments is composed of: 

(1) An experiment using the authoritative setting, i.e., both training and test data are 
annotated by Ca- This means training on Both(l) and testing on 1-only (Batchl) 
and training on Both(2) and testing on 2-only (Batch2). 

(2) An experiment using the non-authoritative setting, i.e., training data annotated by 
Cp and test data annotated by Ca. This means training on Both(2) and testing on 
1-only (Batchl) and training on Both(l) and testing on 2-only (Batch2). 

(3) Experiments using the partially authoritative setting, i.e., test data annotated by 
Ca, and training data annotated in part by Cp (A% of the training documents, 
chosen at random) and in part by Ca (the remaining (100 — A)% of the training 
documents). We call A the corruption ratio of the training set; A = 0 obviously 
corresponds to the fully authoritative setting while A = 100 corresponds to the 
non-authoritative setting. 

We run experiments for each A G {10,20, ...,80,90} by monotonically adding, for 
increasing values of A, new randomly chosen elements (10% at a time) to the 
set of training documents annotated by Cp. Since the choice of training data an¬ 
notated by Cp is random, we repeat the experiment 10 times for each value of 
A G (10,20,..., 80,90}, each time with a different random such choice. 

For each of the above train-and test experiment we compute the intercoder agreement 
K{Tr, corrx{Tr)) between the non-corrupted version of the training set Tr and the (par¬ 
tially or fully) corrupted version corr\(Tr) for a given value of A. We then take the 
average among the 10 values of niTr, corrx{Tr)) deriving from the 10 different experi¬ 
ments run for a given value of A and denote it as k(A); this value indicates the average 
intercoder agreement that derives by “corrupting” A% of the documents in the training 
set, i.e., by using for them the annotations performed by the non-authoritative coder. 

For each of the above train-and test experiment we also compute the extraction ac¬ 
curacy (via both Ff and and the relative loss in extraction accuracy that results 
from the given corruption ratio. 


ACM Journal of Data and Information quality, Vol. V, No. N, Article A, Publication date: January YYYY. 











Effects of Low-Quality Training Data on Information Extraction 


A:11 


Table II. Extraction accuracy for the authoritative setting (A = 0) and non-authoritative setting (A = 100), for the LC- 
CRFs and HM-SVMs learners, for both batches of experiments (and for the average across the two batches), along 
with the resulting intercoder agreement values expressed as k(A). Percentages indicate the loss in extraction 
accuracy resulting from moving from A = 0 to A = 100. 



LC-CRFs 

HM-SVMs 


A 

/t(A) 

Ff 

pM 

Fi‘ 

pM 

Batchl 

0 

1.000 

0.783 


0.674 


0.820 


0.693 


100 

0.742 

0.765 

(-2.35%) 

0.668 

(-0.90%) 

0.786 

(-4.33%) 

0.688 

(-0.73%) 

Batch2 

0 

1.000 

0.808 


0.752 


0.817 


0.754 


100 

0.742 

0.733 

(-10.23%) 

0.654 

(-14.98%) 

0.733 

(-11.46%) 

0.625 

(-20.64%) 

Average 

0 

1.000 

0.795 


0.713 


0.819 


0.724 


100 

0.742 

0.749 

(-6.14%) 

0.661 

(-7.87%) 

0.760 

(-7.76%) 

0.657 

(-10.20%) 


4.2. Results and discussion 

Tablellllreports extraction accuracy figures for the authoritative and non-authoritative 
settings, for hoth learners, hoth hatches of experiments, and along with the resulting 
intercoder agreement values. Figure [l] illustrates the results of our experiments hy 
plotting Fi as a function of the corruption ratio A, using LC-CRFs and HM-SVMs as 
the learning algorithm, respectively; for each value of A, the corresponding level of in¬ 
terannotator agreement n{X) (as averaged across the two hatches) is also indicated. 
Figure [2] plots instead precision and recall as a function of A for the LC-CRFs experi¬ 
ments, while Figure [3] does the same for the HM-SVMs experiments. 

4.2.1. Macroaveraged values are lower than microaveraged ones. A first fact to he observed is 
that macroaveraged (F^) results are always lower than the corresponding microaver¬ 
aged (Ff) results. This is unsurprising, and conforms to a well-known pattern. In fact, 
microaveraged effectiveness scores are heavily influenced by the accuracy obtained on 
the concepts most frequent in the test set (i.e., on the ones that label many test t- 
units); for these concepts accuracy tends to be higher, since these concepts also tend 
to be more frequent in the training set. Conversely, in macroaveraged effectiveness 
measures, each concept counts the same, which means that the low-frequency con¬ 
cepts (which tend to be the low-performing ones too) hav e as much of an impact as 
the high-frequency ones. See IDebole and Sebastiani 200^ pp. 591-593] for a thorough 
discussion of this point in a text classification context. 

4.2.2. HM-SVMs outperform LC-CRFs. A second fact that emerges is that HM-SVMs 
outperform LC-CRFs, on both batches, both settings (authoritative and non- 
authoritative), and both evaluation measures {F^ and F^); e.g., on the authoritative 
setting, and as an average across the two batches, HM-SVMs obtain Ff = 0.819 (while 
LC-CRFs obtain 0.795) and F^ = 0.724 (while LC-CRFs obtain 0.713). Aside from their 
different levels of effectiveness, the two learners behave in a qualitatively similar way 
as a function of A, as evident from a comparison of Figures [2] and [U However, we will 
not dwell on this fact any further since the relative performance of the learning algo¬ 
rithms is not the main focus of the present study; as will be evident in the discussion 
that follows, most insights obtained from the LC-CRFs experiments are qualitatively 
confirmed by the HM-SVMs experiments, and vice versa. 

4.2.3. Coderl generates less accuracy than Coder2. A third fact that may be noted (from 
Table mil is that, for A = 0, there is a substantive difference in accuracy values between 
the two coders, with Coder2 usually generating higher accuracy than Coderl. This 
fact can be especially appreciated at the macroaveraged level (where for LC-CRFs 
we have F^ = 0.674 for Coderl and F^ = 0.752 for Coder2, and for HM-SVMs we 
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Fig. 1. Microaveraged Fi (left) and macroaveraged Fi (right) as a function of the fraction A of the training 
set that is annotated by Cp instead of Ca (“corruption ratio”), using LC-CRFs (top) and HM-SVMs (bottom) 
as learning algorithms. The dashed line represents the experiments in Batchl, the dotted line represents 
those in Batch2, and the solid one represents the average between the two batches. The vertical bars indi¬ 
cate, for each A g {10, 20,..., 80, 90}, the standard deviation across the 10 runs deriving from the 10 random 
choices of corrx{Tr). 





Kappa 



Kappa 


have = 0.693 for Coderl and = 0.754 for Coder2), while the difference is 
less clearcut at the microaveraged level (where for LC-CRFs we have F^ = .0.783 for 
Coderl and F^ = 0.808 for Coder2, and for HM-SVMs we have F^ = 0.820 for Coderl 
and Ff^ = 0.817 for Coder2); this indicates that the codes where Coder2 especially 
shines are the low-frequency ones. 

In principle, there might he several reasons for this difference in accuracy values 
between the two coders. The documents in 2-only might he “easier” to code automat¬ 
ically than those in 1-only; or the distributions of Both(l) and 1-only might be less 
similar to each other than the distributions of Both(2) and 2-only, thus verifying the 
i.i.d. assumption to a higher degree; or Coder2 might simply be more self-consistent in 
her annotation style than Coderl. 

In order to check whether the last of these three hypotheses is true we have per¬ 
formed four fc-fold cross-validation (fc-FCV) experiments (for Both(l) and Both(2), and 
for LC-CRFs and HM-SVMs, in all combinations), using k = 20. Intuitively, a higher 
accuracy value resulting from a fc-FCV test means a higher level of self-consistency, 
since if the same coding style is consistently used to label a dataset, a system tends to 
encounter in the testing phase the same labelling patterns it has encountered in the 
training phase, which is conducive to higher accuracy. Of course, the results of such a 
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Fig. 2. Microaveraged (left) and macroaveraged (right) precision (top) and recall (bottom) as a function of 
the fraction A of the training set that is annotated by Cp instead of Ca (“corruption ratio”), using LC-CRFs 
as a learning algorithm. 




Kappa Kappa 




Kappa Kappa 


Table III. Results of the 20-fold cross-validation 
tests on Both(l) and Both(2), for LC-CRFs and 
HM-SVMs. 



LC-CRFs 

HM-SVMs 


Ff 

Ff 

pivi 

Both(l) 

0.829 

0.735 

0.842 

0.737 

Both(2) 

0.838 

0.771 

0.850 

0.787 


test are difficult to interpret if the goal is to assess the self-consistency of a coder in 
absolute terms (since we do not know what values of Fi correspond to what levels of 
self-consistency), hut they are not if the goal is simply to establish which of the two is 
the more self-consistent, since the two experiments are run o n th e same documents. 
The results of our two fc-FCV experiments are reported in Table Hill From this table we 
can see that the accuracy on Both(2) is substantially higher than the one obtained on 
Both(l), thus indicating that Coder2 is indeed more self-consistent than Coderl. This 
is thus the likely explanation of the higher levels of accuracy obtained on the dataset 
annotated, for both training and test, by Coder2. 

4.2.4. Overannotation and underannotation. A fourth, even more interesting fact we may ob¬ 
serve is that accuracy as a function of the corruption ratio varies much less for Batchl 
than for Batch2, since for this latter we witness a much more substantial drop in going 
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Fig. 3. Microaveraged (left) and macroaveraged (right) precision (top) and recall (bottom) as a function of 
the fraction A of the training set that is annotated by C /3 instead of Ca (“corruption ratio”), using HM-SVMs 
as a learning algorithm. 




Kappa Kappa 




Kappa Kappa 


from A = 0 to A = 100. We conjecture that this may he due to the different annotation 
style of the two coders; the rest of this subsection will he devoted to explaining the 
rationale of this conjecture. 

As evident from Table [H Coder2 annotates, as instances of the concepts of inter¬ 
est, more mentions (+15.7%) and also more tokens per mention (+15.6%) than Coderl; 
relatively to each other, Coderl is thus an underannotator while Coder2 is an overan¬ 
notator. Since, as noted in Section [H learning algorithms learn to replicate the sub¬ 
jective annotation style of their supervisors, a system trained on data annotated by 
an overannotator will itself tend to overannotate; conversely, a system trained by an 
under annotator will itself tend to underannotate. 

Overannotation results in more true positives and more false positives. The plots in 
Figures [ 2 ] and [ 3 ] show that when, as a consequence of increased values of A, the num¬ 
ber of training documents annotated by an overannotator increases (as is the case of 
Batchl), precision suffers somehow (due to the fact that, along with more true posi¬ 
tives, there are also more false positives), but this is compensated by an increase in 
recall (due to an increased number of true positives); as a result, as shown in Figured 
(and in Table [U too), the drop in Fi resulting from moving to A = 0 to A = 100 is very 
limited. Figures[2]and[3]instead show that when, as a consequence of increased values 
of A, the number of training documents annotated by an underannotator increases (as 
is the case for Batch2), recall drops substantially (due to the decreased number of true 
positives), and this drop is not compensated by the stability of precision (which is due 
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Table IV. Results of the approximate randomization 
test, measuring the statistical significance of the dif¬ 
ference between the accuracy of the system trained 
at A = 0 and the accuracy of the system trained at 
A = too. Results are reported for both learners (LC- 
CRFs and HM-SVMs), both batches, and both evalua¬ 
tion measures (Ff and F^). 



LC-CRFs 

HM-SVMs 

Ff 

Ff 

Ff 

Fiu 

Batchl 

0.0859 

0.6207 

0.0001 

0.5040 

Batch2 

0.0001 

0.0001 

0.0001 

0.0001 


to the combined effect of a decrease in true positives and a decrease in false positives); 
as a result, as shown in Figure [T] (see also Table [Till, the drop in Fi resulting from 
moving to A = 0 to A = 100 is much more substantial than for Batchl. 

In order to check whether the decreases in accuracy between the A = 0 and the 
A = 100 settings is st atistically significant we have performed an approximate ran¬ 
domization test (ART) IlChinchor et al. 19931 . In this test the difference is considered 
statistically significant if the resulting p value is < 0.05. Two advantages of the ART 
are that 

(1) unlike the t-test, the ART does not require the data to be normally distributed; 

(2) unlike the Wilcoxon signed-rank test, t he ART ca n be applied to multivariate non¬ 
linear evaluation measures such as Fi HYeh 20001 . 

The results of our statistical significance tests are reported in Table 1 1 VI These results 
essentially confirm the observations above, i.e., that in Batch2 the drop in performance 
resulting from having the training set annotated by the non-authoritative coder (instead 
of the authoritative one) is not statistically significant, while (with the exception of the 
Ff results for HM-SVMs) it is statistically significant for Batchl. 

4.2.5. Caveats. The experiments discussed in this paper do not allow us to reach hard 
conclusions about the robustness of information extraction systems to imperfect train¬ 
ing data quality, for several reasons: 

(1) The results obtained should be confirmed by additional experiments carried out 
on other datasets; unfortunately, as noted in Footnote [H we were not able to lo¬ 
cate any other publicly available dataset with the required characteristics (that is, 
containing at least some doubly annotated documents). 

(2) The dataset used here is representative of only a specific type of imperfect training 
data quality, i.e., the one deriving from the fact that the training data were anno¬ 
tated by a coder different (albeit equally expert) from the one who annotated the 
test set. Other types do exist, however, as noted in the introduction. 

(3) Even the results reported here are somehow contradictory, since a statistically sig¬ 
nificant drop in performance was observed in Batchl while no such statistically 
significant drop was observed in Batch2. 

However, one interesting fact that has emerged from the present study (and that will 
need to be confirmed by additional experi ments, should other datasets become avail¬ 
able) is that, as argued in detail in Section 14.2.41 the lack of a statistically significant 
drop in performance observed in Batch2 seems to be due to the fact that the non- 
authoritative coder who annotated the training set had an overannotating behaviour. 
This might suggest (emphasis meaning that prudence should be exercised) that, should 
there be a need for having a training set annotated by someone different from the au- 
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thoritative coder, underannotation should be discouraged much more than overanno¬ 
tation. 

5. CONCLUSIONS 

Few researchers have investigated the loss in accuracy that occurs when a supervised 
learning algorithm is fed with training data of suboptimal quality. We have done this 
for the first time in the case of information extraction systems (trained via supervised 
learning) as applied to the detection of mentions of concepts of interest in medical 
notes. Specifically, we have tested to what extent extraction accuracy suffers when the 
person who has annotated the test data (the “authoritative coder”), whom we must 
assume to be the person to whose judgment we conform irrespectively of her level of 
expertise, is different from the person who has labelled the training data (the “non- 
authoritative coder”). Our experimental results, that we have obtained on a dataset of 
500 mammography reports annotated according to 9 concepts of interest, are somehow 
surprising, since they indicate that the resulting drop in accuracy is not always sta¬ 
tistically significant. In our experiments, no statistically significant drop was observed 
when the non-authoritative coder had a tendency to overannotate, while a substan¬ 
tial, statistically significant drop was observed when the non-authoritative coder was 
an underannotator; however, experiments on more doubly (or even multiply) annotated 
datasets will be needed to confirm or disconfirm these initial findings. Since labelling 
cost is an important issue in the generation of training data (with senior coders cost¬ 
ing much more than junior ones, and with internal coders costing much more than 
“mechanical turkers”), results of this kind may give important indications as to the 
cost-effectiveness of low-cost annotation work. 

This paper is a first attempt to investigate the impact of less-than-sterling training 
data quality on the accuracy of medical concept extraction systems, and more work 
is needed to validate the conjectures that we have made based on our experimental 
results. As repeatedly mentioned in this paper, one limit of the present work is the 
fact that only one dataset was used for the experiments. This was due to the unfortu¬ 
nate lack of other publicly available medical datasets that contain (at least a subset 
of) textual records independently labelled by two different coders; datasets with these 
characteristics have been used in the past in published research but are not made 
available to the rest of the scientific community. We hope that the increasing impor¬ 
tance of text mining applications in clinical practice, and the importance of shared 
datasets for fostering advances in this field, will generate a new kind of awareness on 
the need to make more datasets available to the scientific community. 
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